Session 10 Solutions: Statistics and modeling

Note: There are often multiple ways to answer each question.

Install and load the caret package. Also load the ggplot2 and dplyr packages. Load the Sacramento dataset with data(Sacramento). Plot a histogram of the house prices with 50 bins.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(Sacramento)
ggplot(Sacramento, aes(x = price)) +
    geom_histogram(bins = 50)

Make a violin plot of price for each level of type. Overlay this with a scatterplot with the same x and y aesthetics, but define the col by factor(beds) and introduce jitter and alpha for the points. Make some observations.

There are many more residential homes compared to condos and multi-family homes. Conds seem to be the cheapest on average. Multi-family homes and residential homes seem to be a bit more expensive on average, but there is a lot more variance in residential home prices.

ggplot(Sacramento, aes(x = type, y = price)) +
    geom_violin() +
    geom_point(aes(col = factor(beds)), position = "jitter", alpha = 0.5)

Do a $t$-test at the 95% level to see if the mean price for multi-family homes is statistically different from that for residential homes. (The test should be two-sided.)

The code below shows that the $p$-value is 0.26, which means that we don’t have enough evidence to reject the null hypothesis of the means being the same.

mf_data <- Sacramento[Sacramento$type == "Multi_Family",]$price
rs_data <- Sacramento[Sacramento$type == "Residential",]$price
t.test(mf_data, rs_data)

## 
##  Welch Two Sample t-test
## 
## data:  mf_data and rs_data
## t = -1.1849, df = 12.884, p-value = 0.2574
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -80384.82  23472.20
## sample estimates:
## mean of x mean of y 
##  224534.7  252991.0

Instead of doing a $t$-test, do a Kolmogorov-Smirnov test at the 95% level to see if the distribution of multi-family homes is statistically different from that for residential homes. (Again, the test should be two-sided.)

The $p$-value is 0.61, which means that we don’t have enough evidence to reject the null hypothesis of the distributions being the same.

ks.test(mf_data, rs_data)

## Warning in ks.test(mf_data, rs_data): p-value will be approximate in the
## presence of ties

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  mf_data and rs_data
## D = 0.21292, p-value = 0.607
## alternative hypothesis: two-sided

Make a boxplot of price for each level of beds. Make another boxplot of price for each level of baths. Do you see a trend? (Note: You will want to plot factor(beds) and factor(baths), as plotting separate boxplots for different values of a numeric variable doesn’t work.)

The more baths or beds, the higher the price of the home.

ggplot(Sacramento, aes(x = factor(beds), y = price)) +
    geom_boxplot()

ggplot(Sacramento, aes(x = factor(baths), y = price)) +
    geom_boxplot()

Fit an additive linear model of price against beds and baths with no interaction term. Interpret the coefficients of the model.

The final model is $price = -3,679 + 22,163 \cdot beds + 88,6572 \cdot baths$. This suggests that adding one bedroom increases the price of the house by about $22k, while adding an additional bathroom increases the price of the house by about $88k.

fit <- lm(price ~ beds + baths, data = Sacramento)
summary(fit)

## 
## Call:
## lm(formula = price ~ beds + baths, data = Sacramento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315683  -67015  -20952   49048  540104 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3679      13455  -0.273    0.785    
## beds           22163       5203   4.260 2.26e-05 ***
## baths          86572       6394  13.540  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106300 on 929 degrees of freedom
## Multiple R-squared:  0.3442, Adjusted R-squared:  0.3428 
## F-statistic: 243.8 on 2 and 929 DF,  p-value: < 2.2e-16

Instead of fitting an additive linear model of price against beds and baths with no interaction term (as in Question 6), fit a linear model as above but with an interaction term.

fit <- lm(price ~ beds * baths, data = Sacramento)
summary(fit)

## 
## Call:
## lm(formula = price ~ beds * baths, data = Sacramento)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333773  -67872  -21501   49655  540493 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28511      33606   0.848    0.396    
## beds           12621      10507   1.201    0.230    
## baths          70584      16578   4.258 2.27e-05 ***
## beds:baths      4463       4269   1.045    0.296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106300 on 928 degrees of freedom
## Multiple R-squared:  0.3449, Adjusted R-squared:  0.3428 
## F-statistic: 162.9 on 3 and 928 DF,  p-value: < 2.2e-16

Drop the columns zip, latitude and longitude from the dataset and save the result to df.

df <- Sacramento %>% select(-zip, -latitude, -longitude)

We have many more columns of information in our dataset df that could help to predict price. Fit an additive linear model (with no interaction terms) of price against all other columns in df. (You will have to google around to find out how to do this!)

fit <- lm(price ~ ., data = df)
summary(fit)

## 
## Call:
## lm(formula = price ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -245958  -43670   -9165   29768  401858 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          23330.328  17515.030   1.332 0.183195    
## cityAUBURN          104667.280  34897.902   2.999 0.002782 ** 
## cityCAMERON_PARK     46072.765  26970.453   1.708 0.087935 .  
## cityCARMICHAEL       85439.901  20345.056   4.200 2.94e-05 ***
## cityCITRUS_HEIGHTS    9665.403  17466.514   0.553 0.580151    
## cityCOOL             93831.945  72605.464   1.292 0.196570    
## cityDIAMOND_SPRINGS  30514.087  72616.662   0.420 0.674435    
## cityEL_DORADO        49692.662  52187.963   0.952 0.341261    
## cityEL_DORADO_HILLS 106484.720  20120.488   5.292 1.52e-07 ***
## cityELK_GROVE        10055.055  14184.769   0.709 0.478595    
## cityELVERTA         -39524.939  37939.080  -1.042 0.297787    
## cityFAIR_OAKS        75469.590  26936.592   2.802 0.005193 ** 
## cityFOLSOM          128168.817  21454.619   5.974 3.35e-09 ***
## cityFORESTHILL       18242.664  72625.771   0.251 0.801727    
## cityGALT              6525.733  19985.347   0.327 0.744103    
## cityGARDEN_VALLEY   157438.155  72835.709   2.162 0.030919 *  
## cityGOLD_RIVER       82640.091  38107.512   2.169 0.030377 *  
## cityGRANITE_BAY     302694.271  43417.787   6.972 6.10e-12 ***
## cityGREENWOOD         6146.228  73102.612   0.084 0.933014    
## cityLINCOLN           3935.477  19873.987   0.198 0.843073    
## cityLOOMIS          373492.368  52402.185   7.127 2.11e-12 ***
## cityMATHER          -52016.806  72700.467  -0.715 0.474491    
## cityMEADOW_VISTA     62105.143  72769.260   0.853 0.393638    
## cityNORTH_HIGHLANDS -26321.587  20156.623  -1.306 0.191940    
## cityORANGEVALE       55858.065  24920.944   2.241 0.025245 *  
## cityPENRYN          318538.623  72614.473   4.387 1.29e-05 ***
## cityPLACERVILLE     110008.457  25876.561   4.251 2.35e-05 ***
## cityPOLLOCK_PINES    46047.983  43199.542   1.066 0.286741    
## cityRANCHO_CORDOVA   -7441.958  18438.617  -0.404 0.686599    
## cityRANCHO_MURIETA  -30755.686  43311.085  -0.710 0.477821    
## cityRIO_LINDA        -2378.510  23533.040  -0.101 0.919517    
## cityROCKLIN          88921.249  21465.959   4.142 3.76e-05 ***
## cityROSEVILLE        62907.351  16279.920   3.864 0.000120 ***
## citySACRAMENTO        -629.545  12999.363  -0.048 0.961385    
## cityWALNUT_GROVE    144896.791  72882.393   1.988 0.047108 *  
## cityWEST_SACRAMENTO -17779.062  43314.173  -0.410 0.681562    
## cityWILTON          185864.842  35372.181   5.255 1.86e-07 ***
## beds                -21735.472   4349.106  -4.998 6.99e-07 ***
## baths                 6576.108   5427.047   1.212 0.225938    
## sqft                   131.523      6.359  20.684  < 2e-16 ***
## typeMulti_Family       987.870  23982.425   0.041 0.967153    
## typeResidential      43262.632  11147.855   3.881 0.000112 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71520 on 890 degrees of freedom
## Multiple R-squared:  0.7156, Adjusted R-squared:  0.7025 
## F-statistic: 54.63 on 41 and 890 DF,  p-value: < 2.2e-16

Make a scatterplot of price against sqft and include the linear model fit on the plot (without standard error intervals). Make the title and axis labels informative. Fit the model as well and look at the R-squared value (how well sqft predicts price).

ggplot(df, aes(x = sqft, y = price)) +
    geom_point(alpha = 0.2) + 
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Price vs. Square Footage", x = "Sq. Ft.", y = "Price")

fit <- lm(price ~ sqft, data = df)
summary(fit)

## 
## Call:
## lm(formula = price ~ sqft, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -231889  -54717  -11822   38993  600141 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13859.393   6948.714   1.995   0.0464 *  
## sqft          138.546      3.796  36.495   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84130 on 930 degrees of freedom
## Multiple R-squared:  0.5888, Adjusted R-squared:  0.5884 
## F-statistic:  1332 on 1 and 930 DF,  p-value: < 2.2e-16

It is worth noting that the model with sqft alone has $R^2 = 0.59$, while the model with all the variables thrown into it had a modestly better $R^2 = 0.72$. This suggests that sqft is the single driving factor of price.

Get the median house price for each city. Plot it on a map, with each city depicted by a point whose size is determined by the number of houses, and color determined by the median price.

city_df <- Sacramento %>% 
    group_by(city) %>%
    summarize(houses = n(), price = median(price),
              lat = mean(latitude), long = mean(longitude))
ggplot(city_df, aes(x = long, y = lat)) +
    geom_point(aes(col = price, size = houses)) +
    scale_color_distiller(palette = "YlOrRd")

It would be even more informative if we could have the base map of Sacramento! Unfortunately, the main package used to do this (ggmap) seems a difficult to use at the moment. We make do with the county map from the maps package.

library(ggrepel)
library(maps)
county_df <- map_data("county") %>%
    filter(region == "california")
ggplot(mapping = aes(x = long, y = lat)) +
    geom_polygon(data = county_df, mapping = aes(group = group), 
                 fill = "black", col = "white") +
    geom_point(data = city_df, aes(col = price, size = houses)) +
    geom_text_repel(data = city_df, aes(label = city, col=price), size = 3) +
    scale_color_distiller(palette = "Spectral") +
    coord_quickmap(xlim = c(-121.6, -120.3), ylim = c(38.2, 39.1))

Session 10 Solutions: Statistics and modeling

Kenneth Tay

Oct 24, 2019